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(g) Integrated circuit for maximum likelihood estimation of transmitted digital data. 



(57) In modern data transmission systems the 
main reasons for the unreliable reception of 
data is the InterSymbol Interference (ISI) t the 
noise together with the errors (in the demodu- 
lation stage and the timing of the samples). 
There is, however, a maximum likelihood re- 
ceiver that minimises these effects in an 
optimum way, using dedicated algorithms. One 
well known algorithm which eliminates ISI, due 
to multi-path propagation, in mobile communi- 
cation receivers is the Viterbi algorithm that 
determines the real transmitted sequence, by 
observing the data for a long time interval. 

The Viterbi algorithm can be used not only for 
the decoding of convolutional code sequences 
but also for the estimation of the maximum 
likelihood sequence of data distorted by ISI. 

The current patent application concerns an 
integrated circuit that executes a properly mod- 
ified Viterbi algorithm using parallel architec- 
ture. Implementations of the Viterbi algorithm 
using commercial DSPs or ASICs have been 
reported in the past. However they do not em- 
ploy a parallel architecture and thus, require 
much longer processing times. 

The main use of the proposed invention is in 
the Maximum Likelihood Sequence Estimation 
(MLSE) of digital data sequences, distorted by 
noise attributed to ISI. The same architecture, 
however, with minor modifications, can also be 
used to decode digital data coded with con- 
volutional codes. 
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In modern data transmission systems, the main reasons for the unreliable reception are the Intersymbol 
Interference (ISI), the noise, together with errors at the phase of the demodulating carrier and the timing of 
the sampling. 

At the Digital Communications and particularly at the transmission of digital data the following phenomena 
5 often appears: 

The input sequence Ik of discrete ti me, is used to modulate a continuous waveform that will be transmitted 
through a channel. During the sampling at the receiver the perfect samples Zk should be identical to the cor- 
responding ones transmitted, which, however, is not true because they are affected by the "noise" from the 
neighbourhood samples. The latter phenomenon is called InterSymbo! interference and causes the correlation 
10 among consecutive symbols of the data sequence. 

There is, however, a receiver that minimises all these effects with a maximum likelihood technique using 
dedicated algorithms. 

One of these algorithms, which is actually used at the receivers at mobile systems to eliminate the ISI ef- 
fects caused by the multipath propagation is the Viterbi algorithm. 

A Maximum Likelihood Sequence Receiver would function in an adaptive mode so as to follow the time 
variations of the transmission channel in applications like the mobile telephony. 

We assume that the transmitter transmits periodically sequences at time intervals called time slots. 
The characteristics of the time slots (Figure 0) are the following: 

Every slot consists of the left and right part of 60 bits each. The 3 first and the 3 last bits are 0 while the 
rest 57+57 are information. At the centre of the slot exists the preamble that is 26 bits of which the 5 first and 
the 5 last are known and are used for the initialisation of the Viterbi algorithm. Finally the 16 bits at the centre 
are information regarding the transmission channel response. 

The communication system employs TDMA (Time Division Multiple Access). A total of 8 subscribers may 
share the channel and transmit every 5ms a slot, i.e. to have total 8 slots (frame) of duration 4,57ms, and bit- 
25 rate 270kb/s with duration of every slot 0.571 ms. 

Because of the previously mentioned interference of the bits, a good strategy to estimate the true trans- 
mitted data sequence, should be based on the observation of a long received sequence, and not on a symbol- 
by-symbol basis. 

Therefore the feature of the Maximum Likelihood Sequence Estimation seems to be an appropriate sol- 
ution. The Viterbi algorithm performing MLSE may be used not only for the decoding of data coded with con- 
volution^ codes but also for the estimation of the maximum likelihood sequence of data received with distortion 
because of the ISI. 

Assuming n is the number of bits for which there exists InterSymbol Interference. If n=5, most of the sit- 
uations that may appear in real operation. conditions may be resolved. The total number of the possible trans- 
mitted sequences are 32 (all possible combinations of the 5 bits), which are called states. 

Every state is characterised by the decimal number which represents the binary sequence of the 5 bits. 
The received signal by the receiver is sampled, and, for every transmitted bit, corresponds an 8-bit value which 
comprises the input data for the algorithm. 

The Viterbi processor performs a step of the algorithm every time it receives a sample. The number of 
steps is equal to the number of bits which modulated the carrier to form a slot. A probability is assigned to every 
one of the 32 possible sequences of 5 bits. Every state is possible to have two predecessors, i.e. the state 
10010 is possible to have as predecessors the 01001 or the 11001. At every step for every state a, i=0...31 
the predecessor state has to be determined. Therefore two probabilities (one for each predecessor) are cal- 
culated, the so called "metrics" are evaluated and the maximum of those metrics is selected. 

It is obvious that at every state except of the corresponding probability there is also a history of "survivor 
paths" that keeps track of the transitions through all the steps of the algorithm. There are as many survivor 
paths as there are states. It is expected that after 15 steps the algorithm will converge and the survivor paths 
tend to have a common part that is called "common history path". The latter is the maximum likelihood estimate 
of the transmitted sequence determined by the algorithm. 

This common history path has occured for previous data and not the current ones. 
Thus, the algorithm makes a decision for the samples received 15 steps before, resulting in a delay that 
leads to a last step which decides about the last 15 transmitted bits. 

The Architectures that have been presented up to now for the Viterbi algorithm refer to implementations 
for decoding data encoded by convolutional codes. 

In the paper J. Stahl et. al. ("High Speed Viterbi Decoder", Signal Processing III Theories and applications, 
Vol. 2, No. 6, pp. 1117-1120, 1986), the three main parts of the architecture are the unit that produces the tran- 
sition coefficients for each state, the unit that computes the new probabilities of each state (unit add-compare- 
choose) and finally the unit that stores the paths for each state. Each unit works on different data (pipe lining) 
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and uses fixed point arithmetic. However this architecture refers only to decoding. 

In the paper T. Ishitani et. al. ("A scarce-State-Transition Viterbi-decoder VLSI for bit error Correction", 
IEEE J. Solid-State Circuits, Vol. SC-22, No. 4, pp. 575-581 , Aug. 1987) a unit add-compare-choose is descri- 
bed that refers not to the execution of the Viterbi algorithm but to a differentiated view of the SST (Scarce- 

5 State-Transition) with 3-bit received samples. There are 64 states assumed and a path length of 29, while 6 
bit adders with 3 bit block lookahead are used. 

As far as the Viterbi algorithm is concerned for data equalisation the presented implementations are either 
with DSP programming or using simple Von Neuman architecture (Blionas et. al. "ASIC Design with the BACH 
Behavioural Synthesis System", The European Design Automation Conference, Glasgow, Scotland Poster 

10 Session 2, March 1990). 

In the paper A. Balboni et. al. ("A Special Purpose Technology independent SIMD Parallel Processor for 
the Viterbi Algorithm", Proc. of IFIP Workshop on Parallel Architectures on Silicon: from Systolic Arrays to Neu- 
ral Networks, Grenoble France pp. 227-239, December 1989) and in the Ph.D. dissertation of S. V. Blionas 
("VLSI Design for Parallel Digital Signal Processing", Dept. of Electrical Engineering, University of Patras, June 

15 1 990), a design and implementation technique for ICs is shown. This technique uses a technology and vendor 
independent data base and has as a pilot application a Viterbi processor that lacks in speed and parameteri- 
sation to the one presented here. The proposed patent achieves better results because it uses a special control 
unit and a parallel sum unit and is used only for equalisation and not for data decoding. 

In a European patent application (EP0467522), an architecture is proposed for the Viterbi algorithm for 

20 data equalisation and decoding. However, this algorithm is of classical Von Neuman architecture and not par- 
allel, as the proposed invention. Also the EPO398690 application calculates sequentially the algorithm, without 
any parallelism in the processing units for all quantities that the algorithm computes. The lack of parallelism 
disables the system to function in real time, especially in a mobile telephony environment (270 Kbits/s), where, 
it is expected, that the rates will quadruple over the next years. 

25 The said integrated circuit is the only specialised monolithic integrated circuit, having parallel architecture 

(with parallel processors for simultaneous computation of all the algorithm's coefficients) that uses the modi- 
fied Viterbi algorithm for removal of InterSymbol Interference from digital transmitted signals. It is based on 
the channel response and by itself may be used in real time data transmission systems (e.g. mobile telephony), 
even in future systems of up to 6 Mbits/sec. The same chip may decode data coded by convolutional codes. 

30 The high operation speed of the said integrated circuit is because of the parallel architecture for the cal- 

culation not only of the basic algorithm quantities, but also of the computation of certain constants by the dedi- 
cated parallel sum unit, as well as by the control unit that permits operation of the circuit at a speed at least 
quadruple of the one specified by the critical path. 

The above characteristics make the said integrated circuit to be the only one that can manage the needs 

35 of mobile telephony which are estimated to be in the order of a few Mbits/sec. 

The description of the figures that follow simply consist a realisation of the invention as an example without 
constituting any limitation whatsoever. 





Figure 0: 


Description of transmitted time slots. 




Figure 1: 


Data transmission System of linear modulated carrier 
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Figure 2: 


Example of an Imaginary data sequence 




Figure 3: 


SIMD array of processors for the Viterbi Algorithm 




Figure 4: 


Block Diagram of processor 0-7 




Figure 5: 


Block Diagram of processor 8-1 5 




Figure 6: 


Interface Unit 
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Figure 7: 


Read procedure timing 




Figure 8: 


Write procedure timing 




Figure 9: 


Control Logic for the Accumulator 




Figure 10: 


Control Logic for the Real Imaginary Registers 




Figure 11: 


Control Unit 
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Figure 12: 


Hold time for set and reset of a register 




Figure 13: 


Synchroniser circuitry 




Figure 14: 


Synchroniser's operation 




Figure 15: 


COM_DEL down counter circuitry 



Consider a synchronous linear carrier-modulated data transmission system as shown in Fig. 1. Because 
of the combination of the In-phase and Quadrature components, {a n } is a complex data sequence (heavy lines); 
i.e. if the data sequence {a^ 5 ...a n }=111000 and assuming that the a n bit is imaginary, the mathematical rep- 
resentation is 1j1-j-1-j. so for the imaginary part of the signal we have 0=-j and 1-j, while for the real part we 
have 0=-1 and 1=1; 
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The pulse amplitude modulation gives : 

5 n 

(baseband waveform) T=symbol spacing, a n =data, 
f(t) = modulated pulse shape 
x c (t) = x(t)co$QJ carrier modulated waveform 
10 Assuming a transmittion channel with the following characteristics: 

g c (t) = impulse response of the channel 

and 

N(t) = additive white Gaussian noise, * = convolution 
At the receiver the following signal is received: 
« Yc(t) = g c (t) * Re {V2x(t)exp(jcD c t)} + N(t) 

However, the y c (t) spectrum has redundancies that can be eliminated without loss of information, using a 
LP Filter. The output of the LPF is: 

20 y(t)=Z H b (*~nT) + VlNWexpC-jOct-jPc) 

n 

where m^^^M^ )] * f(t) = g(t) * f(t). 
Thus, the overall channel response changes the received waveform and y(t) contains convoluted informa- 
25 tion plus the noise which corresponds to the bandwidth of the low-pass filter (LPF). 

From the received signal y(t), a sequence {A n } must be estimated which corresponds to the original se- 
quence {a n }. 

Suppose that we have some observations {aj, for the interval I of the transmitted sequence. The Maxi- 
mum Likelihood receiver determines as the optimum estimated {A n } ( the one that maximises the Likelihood 
30 function p[yft), tell {a n }] 

Assuming that the noise at the transmission channel is additive-white-Gaussian with zero mean value and 
that the h(t) and the timing of the received signal are well known, then: 

Noise Autocorrelation W(t)=2N 0 K(t) for additive-white-Gaussian noise(AWGN) K(t)=5(t) => 
K(t)*K- 1 (t)=5(t) 
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P*exp{^[£ 2Re(a n ^)-£ ( £ cTj s^tf a k ]} 



aTel iTeI kTeI 



s -l =s l = i J h(t r iT)Y^ (tt-t$h(t<fkT)dtfit2 
l=k-i 



z nt S| are outputs from the matched filter (M F) witrf g MF (t)-/)(-() *'/C- 1 fi) 
N 0 = two-sides spectral density, but 

GmfW = ^ Gmf(0. H(0, and K(f) the Fourier Transform of g MF (t), H(t), and K(t) respectively. 

50 This means that g MF (t) exists if the PSD of the noise is always non zero. The MF maximises SNR but ISI 

still exists. The Viterbi algorithm is used to remove ISI caused by the multipath propagation. The maximum 
likelihood sequence estimator (MLSE) selects the appropriate A n , which maximises the value of the following 
metric: 



J I (M ) = I 2Re(a aZa )~£ £ ° xH-Wl -A- 



4 



EP 0 671 836 A2 



z n contains all relevant information about a n 

z n = Z a n-l s l + r n 

5 > 

s, is assumed to be known, R, = E(r n r n+! ) = 2N 0 s, 
R, = Autocorrelation of the noise samples. 
The proposed MLSE receiver operates in an adaptive mode in order to cope with the time varying mobile 
10 channel. Also, because of the ISI results in a correlation between adjacent symbols in a data sequence, the 
decision has to be based on the observation of a long received sequence, and not on a symbol-by-symbol basis. 

The maximisation of expression Ashould be done by computing all the metrics for all the possible sequenc- 
es and comparing them to decide which is the most probable (which gives the maximum metric). This is diff icult 
because of the length and the number of all the possible sequences. It is possible to use the following expres- 
15 sion and reduce the computational load: 



r n Uan-l»«n) = J n-l(«a n -l )+Rc(a n (2z n -s 0 a n - 2 £ s n . k a k )) 



Assuming that the Intersymbol Interference is restricted among only L consecutive samples that are part 
of a data sequence then: 
s,=0 |l|>L 

25 

L 

J n ( -ctn-l^n) = J n -l( ~ ~,*n-\) + Re[a n ( 2z n- s o a n- 2 S s l<*n-l)J 

7=1 

30 

Notice that if we know the metric of a possible binary sequence { a„. then there are only 2 possible 

sequences {.....an- 1f a n } (a n =0 ( a n =1 ), and we can compute the maximum li kelihood using the previous formulas. 

For every sequence of length L corresponds a state ^i jt and we have 2 L such states. If we know the u.j state 
and sequence {ctj_ k ...aj}, then is uj. k is uniquely determined. So if: 
35 Jn(tfn) = the metric for the path from L to \i n via path a n where a n ={a„_ L+1 ...a n } survivor path 

5 n (<*n) = max (Jn(0trvL» <*n- L+i <*n)) 

i.e. for each state ja n we can find the maximum likelihood sequence of length L and definitely determine the 
state L from which \i n was reached. Supposing that for each state there is one such sequence, we have 2 L 
likely sequences to choose from. 

40 Then, for the next step, for each'state u, n+1f we shall use the metrics of the previous evaluation, i.e. J n for 

the computation of the new metrics. 

It is obvious that the path history will depend on a n but the further one looks back from time n-L, the de- 
pendence of the path history on a n will be weakening, so one can expect a common path history up to some 
time (n-L-m). This is the most likely sequence which we are looking for. 

45 Suppose that we have a 5-bit shift register, which may contain 32 possible numbers (2 5 ). In other words 

these are the 32 states. If a new bit enters from the right side of the shift register we have one of the two new 
possible numbers (depending on the value of the shifted bit, 0 or 1 ). From the 32 states there are 64 transitions 
(example 2). 

In each step each state can be reached from two different states and, the algorithm selects the one of the 
50 two states which assigns the highest metric to the final state. The history path of the selected transition will 
comprise of the history path of the previous state, augmented by the information bit that caused this transition. 
For 15 steps we keep the path history of each state renewing it in every step . After a few steps we can notice 
that the 32 path histories have a common part in the beginning which is called "common path history" and of 
course, it corresponds to a unique binary sequence, viz. the detected one. 
55 The inputs to the Viterbi algorithm are the values St and z n . 

The initial conditions are: 

• L=5 and we will use only the s 1f s 2 , s 3 , s 4 , s 5 . 

• a n s 0 a n =s 0 s 0 is always common for all metrics and so may be ignored. 
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o Dividing all metrics by two (2) : 



[/ n (a n )]/2 = Re(a n z n ) 4- ^ax^ [J ^(0^)72 -Re(a n £ s M ] 

7=1 

thus, the shift for the multiplication by two can be avoided, 
o The master processor which will control the Viterbi chip (coprocessor) and will give the state to which 

corresponds to the maximum metric 1024 decimal when all other metric have value 0). 
o At every step n of the algorithm there exists information regarding a n being real or imaginary bit. Every 

two steps the complex values of s, may change. The Viterbi chip should store the path history of every 

state and after 15 elaborations, it will chose using majority logic the first of the last 15 bits of all the 32 

paths. The output should be the total number of O's or 1 's, whatever is the majority, otherwise the number 

of bits of the minority up to the 60th elaboration, 
o Next it has to select the likelihood path of the 8 (0 f 4,8,..28) states and to output the corresponding 15 

bits sequence of one of the 32 total states, 
o S| and z n are complex numbers with both parts of 8 bits (real and imaginary), but only the imaginary 

part of s,, s 3 , s 5 and the real part of s 2 and s 4 will be used (it can be proven from the computation of 

L 

Re(a n ]T Sl a nA ) 

7=1 

The proposed architecture for the integrated circuit is a Single Instruction Multiple Data (SIMD) one. The 
possibility for a systolic architecture was abandoned because of the nature of the algorithm (to proceed to the 
next step all results of the previous must be present) and the low yield (for choosing 32 states the yield is in 
the order of 6%). 

To execute the algorithm described above for 32 states, 16 processors are used. This array has been re- 
alised on the said integrated circuit with the following pins: 

□ 8 pins input/output data bus 

□ 3 pins input/output address bus 

□ RD and WR input pins for reading and writing respectively of the data from/to the integrated circuit 

□ IR input pin to define the mode of operation of the Viterbi processor (real or imaginary type) 

□ IP output pin, as a flag to signal if the processor is in a wait status (P=1) 

□ reset pin, which is the hardware reset of the system reset=1=active 

□ CE input pin to enable the integrated circuit (must always be CE=1) 

The processing elements are of two sorts which are identical except for their control signals that differ to 
allow each processor to perform a different function. Thus, there are 16 processors (Fig. 3), 8 "even" and 8 
"odd" ones (according the state they correspond to). Assigning to each state its decimal number, as only 16 
processors are used, each one computes two states. As the mapping was done by "folding" the first processor 
corresponds to states 0 and 31 and the last to 15 and 16. 

Each processor has a local memory (Fig. 4 and 5). Locally there are stored in a register file the correspond- 
ing survivor paths and the metrics of the states that each processor represent. Another difference between 
the elementary processors is that in the communication registers of the processors 0-7 may be stored data 
that reside either in the local registers or in an adjacent processor (local communications), while in processors 
8-15 only data from the upper processor element can be loaded (i.e. only shifting can be performed for the 
delivery of the data between the processors) (Fig. 4 and 5). The local communications that needs the algorithm 
are such that each processor must communicate with its predecessor and with its adjacent one (Fig. 3). 

Because of the local communications a dual processor was defined and an architecture with 8 identical 
dual processors implemented. These communicate only with the preceding arid the succeeding one. The dou- 
ble elementary processors are depicted in Figure 3, framed by a dashed line and divided at left and right part. 

The interface unit contains: 

□ address register (START-ADD, 6 bits) where the address of the ROM is stored where the first linecode 
starts; 

□ preamble register (PREAMB-REG, 5 bits) where are stored the first five bits of the transmitted se- 
quence; 

□ 2 8-bit registers to keep the last 15 bits of the estimated transmitted sequence; 
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□ 8-bit register 2-REG to record the samples for the sample to be processed. It should be noted that this 

register feeds a global bus that all other registers read (Fig. 6). 
The communication protocol is the following: 
the unit that co-operates with the Viterbi processor looks at the integrated circuit as an intermediate data bank 
5 to which it writes data and reads results. 

The write operation is asynchronous (Fig. 7) and has as follows (WR=1, RD=0, CE=1): 

1. minimum address setup time = 10nsec 

2. minimum data setup time » 3nsec 

3. minimum write pulse duration = 3nsec 

10 4. minimum address data hold time = 3nsec 

5. minimum data duration time = 3nsec 

and occurs at the following addresses: 

Address Operation 

000 received sample (Z value) 

15 001 coefficient S1 

010 coefficient S2 

011 coefficient S3 

100 coefficient S4 

101 coefficient S5 
20 110 Preamble 

111 start address of the subroutine and GO 

The integrated circuit performs two main subroutines: the last step subroutine and the main program sub- 
routine. The start addresses for these subroutines are: 
Main program = 00, Last step subroutine - 26. 
25 The timing to write data to addresses from 0 to 5 is depicted in Fig. 8 and is the same to the timing of writing 

to a register: 

1. Minimum address setup time = Snsec 

2. Maximum data propagation delay time = 1 Snsec 

3. Maximum data hold time after deactivation of read pulse = 3nsec 

30 4. Minimum address hold time after deactivation of read pulse = 0 nsec 

It should be noted that the address bus is decoded to provide its outputs as (activating) clock signals to 
the registers that store the mentioned values. 

To write to addresses 6 (110) and 7 (111) the duration of the write pulse must be at least equal to 2 periods 
of the clock since the processor should be synchronised with the master for its start of operation. The results 
35 are either of the main program (2 bits) or of the last step (16 bits). The latter are read in two phases as the 
bus is 8 bits wide. The addresses are as follows: 
RD=1, CE=1, WR=0 
Address OUTPUT 

□ 000 main program results 

40 - □ 001 last step results (8 Least Significant Bits) * " " 

□ 010 last step results (8 Most Significant Bits) 

The data-read timing is the same as in Fig. 8. 

It should be noted that, to activate any operation, the CE pin must be high. 

When the Viterbi processor is not in a wait state it cannot communicate with the master processor and is, 
45 thus, cut-off and not able to receive or submit data. 

The sum unit prepares the transition metrics that will be used by the processors. It contains 5 registers of 
8 bits each that store the coefficients of the channel (s1 to s5). 

These registers feed with data a couple of 12 bit accumulators which compute 8 transition coefficients 
each. The reason for using 2 accumulators in parallel is that the preparation time of the coefficients from one 
50 accumulator lasts twice as long as the main program that the processors run. 

There has been some "logic" added for the control so that by using the same control signals the operations 
of the second accumulator are appropriately modified. The data that are stored in the registers are loaded from 
the global input/output bus. The control of these registers is performed by six control lines that come directly 
from the address decoder in the interface unit The outputs that feed the processors are two twelve-bit buses. 
55 There are five control lines to this unit, which together with the value of the flag IR (which defines the mode 
of operation of the processor - real or imaginary) prepare the imaginary coefficients that will be used in the 
next step (when the processor operates in real mode) while in the current step the real coefficients are used 
that were computed in the previous step. This decreases the rate that the channel coefficients to s 5 can be 
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re-evaluated to one time every two steps of the algorithm. It is obvious that the operation of the two accumu- 
lators is modified because of the extra logic that has been assigned to them. In tables 1 and 2 the operation 
according to the control signals are described. 

One of the control bits is the register write enable, which is not included in the table. 

Table 1 refers to the selection of the operators, while table 2 to the type of operation. The contents of the 
Accumulator are called A, while the selected operand B. 

It should be noted that when one of the coefficients sO, s2 or s4 is selected, then line CNTL(4) is 1. Thus 
when the processor is in real mode (i.e. the imaginary mode transition coefficients are prepared) the control 
signals of the operation are alternated via a multiplexer like the addition A+B with the subtraction operation 
A-B and vice versa. As for the accumulator B, this always executes the opposite operations from A either during 
the real or the imaginary mode of operation , except when operator s4 is chosen. The commands "load selected 
operator" and "B-A" are executed simultaneously on both accumulators. The above described can be seen in 
detail in table 2. Also while in the description of the algorithm it seems that 64 coefficients are needed, here 
the sum unit computes only 1 6. The reason is that all possible combinations between the 5 channel coefficients 
are 32 coefficients with 16 being opposite to the others. Thus by having the 16 coefficients the operations 
and modifying the operations the same results are achieved. The coefficients that are serially produced are 
stored in a series of 1 6 registers, one in each processor and connected thus as to form a double chain of shift 
registers. In this way only local connections are kept between processing elements and the control lines for 
the write operation are minimised. After the end of the main program that the array of processors executes, 
the appropriate coefficients are stored at each processor. 

In Fig. 4 and 5 the block diagram of the single processing element ("even" and "odd" respectively) is de- 
picted. These two processors make up the "double" processing element. The double processors are divided 
into two parts, left and right respectively. 

Both contain two accumulators of 1 6 bits each and two inputs from which the operand to be accumulated 
is received in two's complement and in one's complement respectively. A third input feeds all the accumulators 
with a global coefficient that is the value of the sample to be examined each time (Z value). 

Finally four pins determine the behaviour of the accumulator as control pins. This behaviour, with respect 
to the control pins is described in table 3. As the functions required by the accumulator are less than 8, they 
could be represented by 3 pins. 

Thus these three signals come from the control unit and are decoded by a logic circuit (Fig. 9). The logic 
functions that are implemented, assuming as inputs a, b and c, are: zO=a.b, z1 =b.a, z2=a.b+a.b, z3=c (it should 
be noted that the mapping of the functions of the accumulator is not random but such that by a simple exchange 
of pins the opposite functions of the accumulator are performed). 

Each accumulator is controlled by the signals IN{1) and IN(0), their logic complements or the same switch- 
ed. 

If IN is the initial control signal and In the inverted, the mapping of the functions is such that each accu- 
mulator executes the function that corresponds for the evaluation of its corresponding state. When the global 
coefficient or the input operator is loaded then all accumulators execute the same function. 

Externally the accumulator has the same structure as the accumulator of the Sum Unit. The operator and 
its inverse are the inputs of a 2-input 1-output multiplexer, of which the output makes the one of the two inputs 
of a 1 6-bit fast two's complement adder. 

The second input of the adder is the output of the register which stores the result (or its inverse depending 
on the choice made by an internal multiplexer). The output of the adder is one of the inputs of a third multiplexer 
having as other inputs the input operator and the global coefficient. The output of this multiplexer is stored in 
the accumulator's register. 

In the case of the subtraction mode content-register the complement value of the register is selected and 
the carry input of the accumulator is set to logic 1. 

In case of overflow two cases are examined: positive and negative overflow. In both the contents of the 
register are maximised, positively and negatively respectively. 

The reason is that in the specific algorithm the result is not important as a number but simply as a value 
to be compared to the other values. Care must be taken in the synchronisation of the pre-setting of the register 
in the maximum value, positively or negatively. During the time that the adder is preparing the result, latches 
are watching for any overflow. However, when the register loads the result, then in case of a positive overflow 
there is asynchronous pre-setting of the register to the value 7FFF, while in a negative overflow to the value 
8000. It is necessary to synchronise the pre-setting with the clock because during calculations there may be 
spikes. The clock of the accumulator is not the same with the global distributed clock but the logical AND be- 
tween it and the signal of the control unit for the activation of the register of the accumulator. 

The output of the two accumulators of the processing element is the input for a magnitude comparator of 
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two's complement numbers. 

The output of this comparator controls a multiplexer to select the maximum arithmetic value as output of 
the unit. 

The input to the accumulator is provided by the registers AC_REG which are buffers so to move data in 
5 parallel within the array of the processors, while the accumulators prepare a result. 

Other registers store the values of the metrics of the algorithm which have been chosen by the comparator. 
Two more registers store at every step of the algorithm the survivors. These are shift registers and besides 
being able to load a new value they are able to shift their content to the left, while the value assigned to the 
least significant bit depends on the state that is represented by the specific register (even=0, odd=1). 
10 The Real and Imag registers store the constants that the sum unit calculates which relate to the channel 

response through which the data are transmitted. Instead of the registers, u.RAMs could be used but this would 
add extra decoders to the circuit for each RAM and increase of the access time, while not allowing the parallel 
loading of the registers Real or Imag with values corresponding to the new channel response. 

The same registers exist in the right part of the double processor as well as the same arithmetic unit. 
15 There are finally two communication registers which are the key point of this architecture. In register A 

any coefficient or variable of the left part can be stored, while in register B the same but for any of the right 
part. Additionally both of these registers can load the contents of the respective register of the previous proc- 
essing element to which they are connected. The communication registers of the right part can load only the 
content of the previous processor of the respective registers (A or B). By this means of communications, a reg- 
20 ister chain is formed which makes up a Shift-Register. In this way the data can be transmitted from any proc- 
essor to any other. This communications scheme was preferred over the direct communications because of 
the large area that direct interconnections would occupy on the chip and the complex communications as each 
register needs to be connected to more than one registers in completely different areas of the circuit. Thus 
direct communications would occupy the greater part of the circuit 
25 According to the Viterbi algorithm, from each survivor after the 15th step, the number of ones (1) has to 

be checked if all are 0 or 1 or else the majority has to be calculated. This is achieved by a tree-adder which 
in each processor consists of one-bit half adder. The two-bit outputs of these two at every processing element 
feed a third 2-bit adder without carry-in. This is the partial result of each processor giving the number of ones 
for 4 survivors. The tree-adder is ending at a five-bit adder having as output 6 bits (which is the sum of the 1 
30 from all survivors); according to some additional logic this produces a 2-bit code signalling the decision for the 
transmitted bit before 15 steps. Table 4 shows the correspondence. 

When ad =32 the Z1 ,22=1 1 when ad=0, 21 ,22=00 if ad<=1 6 Z1 ,22=01 if ad>=1 6 21 ,22=1 0. There are also 
other solutions to give Z1 t 22 like a counter or a logic circuit to perform a logic function having as inputs the 
32 bits of the survivors and output 21 ,22. The reason the tree adder was chosen is that it has less hardware 
35 and is faster compared to the other solutions. 

The two bits 21 , 22 are the result of the array of the processors each time it is called to calculate data and 
to estimate the value of a sample. However the value of 21 , 22 is not valid for the first 1 5 calls of the processor. 
Only after the 15th call the information is considered valid since only by then is ensured that the algorithm al- 
ready has converged and there would be a common result from all 32 survivors, i.e. GO or 11 (common path). 
40 Of course the read result after the 15th call, i.e. at the end of the 16th call, will relate to the value that was 
received by the processor during the 1st call. This is achieved by the 16-bit shift register that stored the survivor 
after each call and shifts its previous content by one position to the left, while the information is written to the 
right at the Least Significant Bit. Thus, for the result to be able to reach to the left (most significant bit) there 
have 16 calls to be made. To cover this delay one of the survivors will be given as a final result for the last 15 
45 bits to the main processor that calls the Viterbi. 

Through a process, to be explained in the next paragraph and is concerned with the data flow, the maxi- 
mum likelihood survivor from the 32 is stored in the communication register of the right part of the 8th (last) 
double processor. The 1 5 bits are written in 2 registers (A-LSB-PTH and A-MSB-PTH) of 8 bits each to be able 
to feed the input/output bus after the call of the last step (Fig. 6). Then a multiplexer (MX-F-O) controlled by 
so the address pins, outputs the first or the second part of the final survivor. This multiplexer is also the one that 
leads to the output the results 21, 22 during the normal calls of the integrated circuit in question. 

Finally in each double processor there is a 1-bit register whose input is connected to the output of the pre- 
vious while the first in the row has as input a logic 0. Thus an 8-bit shift register is formed which is used as a 
mask register for the moving of data. 
55 This shifter has asynchronous setting while its dock is controlled by the same logic that controls the clock 

of the communication registers A. 

Thus if the pre-setting pin is set, all are set to 1 . Then after the circuit operates for 8 clock pulses the results 
of Table 5 appear. 
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Many inverters have been used to buffer the lines, which due to the Single Instruction Multiple Data ar- 
chitecture have to feed a large number of inputs. Thus in the circuit each output has a maximum fan-out of 4. 

In the metrics registers there are two additional inputs which are needed for their pre-setting before the 
first call (by the master) of the chip. 
5 The first 5 bits that are transmitted are called preamble and are loaded in a 5-bit register (PREAMB-REG) 

(Fig. 6), before any data processing starts. This register is located in the interface unit and loads this value 
that is provided by the master at address ad=110. 

Simultaneously with the writing, at the address 110 is enabled the JK Flip-Flop that keeps the mode of 
operation of the Viterbi Processor. The inputs of the JK Flip-Flop are given from a 2 bits multiplexer. The inputs 
10 of the latter are 1 bit from the Control Unit duplicated as the first choice and the external input IR and its com- 
plementary is the other choice. 

When no write operation takes place at address 110, the inputs of the JK are the duplicated wire from the 
Control Unit and functions as a T Flip-Flop. When the J and K inputs are low the JK keeps the previous value, 
if is high it toggles to the complementary value. 
15 If data is written to address 110 then it is written the value applied to external pin IR. Simultaneously the 

preamble is written and after a delay of a clock cycle the metric registers are preset to their initial values. This 
circuit is a couple of Flip-Flops, and a latch. The delay is due to ensure that the right data is written to the pre- 
amble register. 

The metrics registers are 15 Flip-Flops with an asynchronous clear input, (the 10th Flip-Flop that stores 
20 the 10th bit of a metric has an additional preset input). The clear inputs are activated during the writing to ad- 
dress 110, when the preset from the 32 bits "active low decoder with enable" (AD-DE, Figure 6). 

This decoder when the enable is low all outputs are set at 1 . Its input is the output of the preamble register. 
When the enable is active only one output that corresponds to the value of its input goes at 0. 

The metric register connected to that output is preset to the decimal value 1024 (since the 10th bit is 1 
25 and all others 0s). It is prevented with a NAND gate to have active both reset and preset inputs. After writing 
to address 110 all metric registers have value 0 and only one the value 1024. 

It is ensured that first the preamble register is written and only when its output is stable the decoder's en- 
able is activated to reset and preset accurately the Flip-Flops of the metric registers. 

The registers Real Imaginary are enabled by the Sum Unit by a wire of the Control Unit that in combination 
30 with the operation mode of the processor IR (Real or imaginary), activate for writing the proper chain of reg- 
isters. The logic functions CIR and CIR, are the enable for the imaginary and real registers respectively (Fig- 
ure 10). 

The control Unit consists of a Read Only Memory (128 words, 30 bits wide ROM), (Figure 11), with an 
enable pin (if it is high, the value from the input bus "address(5:0)'\ is latched and when it is low, outputs the 

35 data at bus OUT(30:1) after a certain delay T_delROM =24ns for the 1.0 standard cell technology used for 
the implementation of that chip. 

The address of the ROM is generated by a presetable linear feedback counter of 7 bits with a lock input. 
It consists of 6 D Flip-Flops connected in series and a JK Flip-Flop that is the Least Significant Bit (LSB). The 
complementary output of this JK FF is the input of the Most Significant Bit (MSB) D Flip-Flop. 

40 All Flip-Flops have asynchronous set and reset inputs. The states through which the counter goes is a stan- 
dard sequence and not ascending numbers. The state FF(HEX) is not accessible and only activating the lock 
pin of the counter or pre-setting the counter to that value is possible to access it. The counter is possible to 
escape from that "trap" state (wait status) only by pre-setting it to another value and the it follows its normal 
state flow. 

45 The wait status is used by the Viterbi processor when it is idle and waits the master to give data for proc- 
essing. When this happens actually the preset pin "prst" of the program counter is set to 1 and the value at 
the bus start_address (connected to the preset input bus of the counter), is written asynchronously. This asyn- 
chronously writing is achieved by simple logic which activates the set or reset input accordingly to if it is 1 or 
0 at the wires of the preset input bus. 

50 The data written to the program counter is the content of the start_reg register of the interface Unit and 
corresponds to the address of the ROM where the first word of the code to be executed is located. The startj-eg 
register is written asynchronously and in the address 111 by the master processor, and after a certain delay 
the prst pin of the program counter is activated to latch the starting address of the routine. The writing is done 
according to the timing restrictions (setup and hold of set and reset, Figure 12, t1 = hold for set and reset). 

55 This synchronisation is achieved by the synchroniser of Figure 13 which functions as it is described in fig- 

ure 14. There are three Flip-Flops (DF_1 ( DF_3 and DF_5). 

The two first are clocked by the inverse of the Global clock CK when the third one DF__5 by the normal 
dock. The gate NAND2 that has inputs the output of the DF__1 and DF_3 feeds the input of DF_5. 
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When a rising edge arrives (when is activated) then with a delay at most of one clock cycle a positive pulse 
of a maximum duration T/2 appears at the output Q5. So it is ensured that during the asynchronous preset of 
the program counter alt the timing restrictions will be respected. Also during the preset of the metrics registers 
it is foreseen to have the output of the preamble_reg stable because the preset of the metric registers is done 

5 from T/2 to 3T/2 before the pulse prst appears. 

After the start address is written the program counter starts generating the addresses of the program code 
until it finds the code line which is the return command which returns the program counter to the wait status 
and the processor remains idle. This is achieved by activating the input of asynchronous preset of the counter 
to the state "all ones (FF)". The LSB bit of the data of the ROM with the hardware reset of the entire chip are 

10 inputs to a NOR2 gate the output of which is connected to the set input of the program counter. Any of both 
when they are high the program counter is set to FF. 

The control Unit has also another counter that is used to adjust the time-length of every code line. Every 
command is characterised from its "time-length" that is from 1 to 4 maximum clock cycles. A couple of bits 
into the ROM store this information that at every linecode presets asynchronously the down counter 

15 COMMJDELAY. 

Its circuitry is described into Figure 1 5. It consists of a D-Flip-Flop and a JK Flip-Flop, the complementary 
output of which is the input of the D-FF. Both have asynchronous set and reset inputs (set, reset). There is a 
dedicated logic to avoid synchronous activation of both inputs. This might happen because these inputs are 
direct from the ROM without first to be latched (they correspond to the "time-length" of the current command). 
20 The hardware reset, presets comm_delay to 11. 

The output of COMM-DELAY controls through gates the clock of the program counter and puts it to the 
next state as soon as the number of the pulses and counts the COMM_DELAY is the time length of the com- 
mand. 

Proper delays of T/2 or T are realised using latches or Flip-Flops respectively to preset asynchronously 
25 the comm_delay and to load the command register. 

The subroutine of the normal step of the Viterbi algorithm some operations could take longer than a period 
(comparison and selection or addition). In these cases the time length of the command may be increased so 
to achieve the correct operation. 

This is an advantage to the circuit because the control ROM clock could be increased and avoid the critical 
30 paths of the complex parts of the circuit to limit the maximum operating frequency. Also in case that a command 
should be repeated in consecutive clock cycles we just put the number of repetitions on the time length of the 
command and using just one word of the ROM to have multiple execution (Shift operation in the communication 
registers). The address of each command at the output of the program counter appears one clock before its 
loaded to the command register the proper content of the ROM. The 30 pins of the ROM, the 11, 7 and 26 (they 
35 control the loading of the ReaMmag registers and the accumulators of the Sum Unit and of the Processing 
elements), are activated during the last pulse of a command of duration more than one clock cycle. 

There follow the Tables that refer to the above text and help to the understanding of the current embodiment 
of the invention. 
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ADDRESS CNTL(4:2) 


SELECTED VARIABLE " 




000 


Constant 0 




001 


A/2 


45 


010 


S1 




011 


S3 




100 


SO 




101 


S2 


50 








110 


S4 




111 


S4 


55 


Table 1: Addresses of selection for the variables of the Ac- 
cumulator In Sum Unit. 
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Load of the 
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A A + B 


Aa-B 
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A A -B 


Ab + B 


01 


X 
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1 


A A -B 


Ab - B when S 4 is selected 


10 
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1 


1 


A A + B 


Ab + B 


01 


0 


1 


0 


A A - B 


Ab + B when S 0 or S 2 is selected 


10 


0 


1 


0 


A A + B 


Ab-B 


Table 2: Control signals and corresponding operations at the Accumulators of the Processing elements. 



INPUT 
IN(20) 


OUTPUT 
OUT(3:0) 


OPERATION 


000 
001 


ooxx 

01XX 


Load of the "global" 
coefficient 


010 


01XX 


Load of the input register 


Oil 


OOXX 


Load of the input register 


100 


1X00 


Load of the "global" 
coefficient 


101 


1X10 


Accumulator -f Input 


110 


1X00 


Accumulator - Input 


111 


1X01 


Accumulator + Input 
Input - Accumulator 



Table 3: Control Signals of the Processing elements Accumulators 
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ADDER RESULTS 




ad(5)+ad(4)-ad(0) 
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00 
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X 


11 


X 


X 
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0 


0 
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Table 4 
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a(7:0) 


1 


11111111 


2 


01111111 


3 


00111111 


4 


00011111 


5 


00001111 


6 


00000111 


7 


00000011 


8 


00000001 


Table 5: Data 
Flow at the 
mask register. 



Claims 

1. Integrated circuit implemented on the surface of a semiconductor material having a parallel architecture 
of a Single Instruction Multiple Data type, that accepts in Real time, transmitted sequences of Data, runs 
the modified Viterbi algorithm, estimates the maximum likelihood transmitted sequence using the infor- 
mation of the transmission channel response and is characterised by: 

Central Control Unit comprising a non volatile memory which keeps the algorithm code, program counter 
for address generation, counter to adjust the time length of a command and other glue logic. 
Interface Unit that communicates asynchronously or synchronously to receive data necessary for the exe- 
cution of the algorithm, and to output the results of the processing as well as for the initialisation of the 
claimed integrated circuit. 

on chip coprocessor that calculates indermediate coefficients to be used by the Viterbi algorithm using 
the information of the transmission channel response, 

A plurality (at least two) of processing elements arranged on the semiconductor material, forming an array 
and running in parallel according to the instructions that feeds the said Control Unit, they exchange data 
locally or even not (every one with its next and/or its previous and/or its besides according the array struc- 
ture),, every processing element embodies accumulators, comparators, multiplexers, registers, and/or 
other glue logic components. 

2. The said integrated circuit of claim 1 function assuming that in the transmission channel the data inter- 
ference is among 2 to 10 consecutive digital data sequence samples. 

3. The said integrated circuit of claim 1 , 2 characterised by it requires from 2 to 25 samples of the transmitted 
data sequence to output the first result concerning the first bit of the said data sequence. 
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The said integrated circuit of claim 1 , 2 and 3 characterised by it may use any combination of the choices 
mentioned at claims 2 and 3. 

Integrated circuit implemented on the surface of a semiconductor material having a parallel architecture 
of a Single Instruction Multiple Data type, that accepts in Real time, transmitted sequences of Data, coded 
with convolutional codes, runs the modified Viterbi algorithm, estimates the maximum likelihood trans- 
mitted sequence using the information of the transmission channel response, then executes the Viterbi 
algorithm and decodes the data using the information about the code which encoded the data, and is char- 
acterised by: 

Central Control Unit including the algorithm code in a non volatile memory, program counter for address 
generation, counter to adjust the time length of a command and other glue logic. 
Interface Unit that communicates asynchronously or synchronously to receive data necessary for the exe- 
cution of the algorithm, and to output the results of the processing as well as for the initialisation of the 
claimed integrated circuit. 

on chip coprocessor that calculates the intermediate coefficients to be used by the Viterbi algorithm using 
the information of the transmission channel response, 

A plurality (at least two) of processing elements arranged on the semiconductor material by means of an 
array or random and running in parallel according to the instructions that feeds the said. Control .Unit,. they 
exchange data locally or even not (every one with its next and/or its previous and/or its besides according 
the array structure),, every processing element embodies accumulators, comparators, multiplexers, reg- 
isters, and/or other glue logic components. 

The said integrated circuit of claim 5 function assuming that in the transmission channel the data inter- 
ference is among 2 to 1 0 consecutive digital data sequence samples. 

The said integrated circuit of claim 5, 6 characterised by it requires from 2 to 25 samples of the transmitted 
data sequence to output the first result concerning the first bit of the said data sequence. 

The said integrated circuit of claim 5, 6 and 7 characterised by it may use any combination of the choices 
mentioned at claims 2 and 3. 
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(57) In modern data transmission systems the main 
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that determines the real transmitted sequence, by ob- 
serving the data for a long time interval. 

The Viterbi algorithm can be used not only for the 
decoding of convolutional code sequences but also for 
the estimation of the maximum likelihood sequence of 
data-distorted by ISI. 

The current patent application concerns an integrat- 
ed circuit that executes a properly modified Viterbi algo- 
rithm using parallel architecture. Implementations of the 
Viterbi algorithm using commercial DSPs or ASICs have 
been reported in the past. However they do not employ 
a parallel architecture and thus, require much longer 
processing times. 

The main use of the proposed invention is in the 
Maximum Likelihood Sequence Estimation (MLSE) of 
digital data sequences, distorted by noise attributed to 
ISI. The same architecture, however, with minor modi- 
fications, can also be used to decode digital data coded 
with convolutional codes. 
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